431 Class 10

Thomas E. Love, Ph.D.

2023-09-28

Our Agenda

  • Ingesting the favorite movies data
  • Cleaning and Managing the data
  • Addressing Your Exploratory Questions from the Class 8 Breakout

Today’s R Packages

library(googlesheets4)
library(broom)
library(ggrepel)
library(ggridges)
library(gt)
library(mosaic)
library(janitor); library(naniar); library(patchwork)
library(tidyverse)

theme_set(theme_bw())
options(width = 70)
knitr::opts_chunk$set(comment = NA)
  • #| message: false silences messages here.

Ingesting the Data

Our Google Sheet

Ingesting from our Google Sheet

gs4_deauth()

movies23 <- 
  read_sheet("https://docs.google.com/spreadsheets/d/1qJnQWSjXyOXFZOO8VZixpgZWbraUW66SfP5hE5bKW4k") |>
  select(film_id, film, year, length, 
         imdb_ratings, imdb_stars, imdb_categories) |>
  mutate(film_id = as.character(film_id))

dim(movies23)
[1] 201   7
names(movies23)
[1] "film_id"         "film"            "year"           
[4] "length"          "imdb_ratings"    "imdb_stars"     
[7] "imdb_categories"

The favorite movies data

movies23
# A tibble: 201 × 7
   film_id film    year length imdb_ratings imdb_stars imdb_categories
   <chr>   <chr>  <dbl>  <dbl>        <dbl>      <dbl> <chr>          
 1 1       3 Idi…  2009    170       417000        8.4 Comedy, Drama  
 2 2       8 1/2   1963    138       122000        8   Drama          
 3 3       10 Th…  1999     97       366000        7.3 Comedy, Drama,…
 4 4       2001:…  1968    149       697000        8.3 Adventure, Sci…
 5 5       About…  2009    119        56000        7.9 Drama, Mystery 
 6 6       About…  2013    123       371000        7.8 Comedy, Drama,…
 7 7       Alien   1979    117       918000        8.5 Horror, Sci-Fi 
 8 8       Amade…  1984    160       416000        8.4 Biography, Dra…
 9 9       Avatar  2009    162      1400000        7.9 Action, Advent…
10 10      Aveng…  2018    149      1200000        8.4 Action, Advent…
# ℹ 191 more rows

Broad Summary

movies23 |> summary()
   film_id              film                year          length     
 Length:201         Length:201         Min.   :1942   Min.   : 70.0  
 Class :character   Class :character   1st Qu.:1996   1st Qu.:104.0  
 Mode  :character   Mode  :character   Median :2006   Median :118.0  
                                       Mean   :2003   Mean   :123.8  
                                       3rd Qu.:2013   3rd Qu.:138.0  
                                       Max.   :2023   Max.   :207.0  
  imdb_ratings       imdb_stars    imdb_categories   
 Min.   :   4300   Min.   :3.400   Length:201        
 1st Qu.: 157000   1st Qu.:7.100   Class :character  
 Median : 357000   Median :7.800   Mode  :character  
 Mean   : 543464   Mean   :7.563                     
 3rd Qu.: 769000   3rd Qu.:8.100                     
 Max.   :2800000   Max.   :9.300                     
pct_complete_case(movies23)  ## from naniar
[1] 100

Your Questions (1-4)

  1. Are movies getting longer? (year, length)
  2. Which categories/genres have higher ratings? (imdb_categories, imdb_stars)
  3. Are longer movies rated more highly? (length, imdb_stars)
  4. Which categories/genres have more ratings? (imdb_categories, imdb_ratings)

See this link for more details.

Your Questions (5-8)

  1. Do more recent movies get more ratings? (year, imdb_ratings)
  2. Do more recent movies have higher ratings? (year, imdb_stars)
  3. Are ratings and stars associated? (imdb_ratings, imdb_stars)
  4. Which years have the most movies in our sample? (year)

See this link for more details.

Exploring and Cleaning Data

Basic Exploration: year

p1 <- ggplot(data = movies23, aes(x = year)) +
  geom_histogram(binwidth = 5, fill = "royalblue", col = "white") + 
  labs(x = "Year of Release", y = "Number of Movies")

p2 <- ggplot(data = movies23, aes(x = year, y = "")) +
  geom_violin() +
  geom_boxplot(fill = "royalblue", width = 0.3,
               outlier.color = "royalblue", outlier.size = 3) +
  stat_summary(fun = "mean", geom = "point",
               shape = 23, size = 3, fill = "white") +
  labs(y = "", x = "Year of Release")

p1 / p2 + plot_layout(heights = c(2,1))

Basic Exploration: year

Normal Q-Q plot for year

ggplot(data = movies23, aes(sample = year)) +
  geom_qq(col = "royalblue") + geom_qq_line(col = "red") +
  theme(aspect.ratio = 1) +
  labs(x = "Expected N(0,1)", y = "Year of Release")

Consider age = 2023-year

movies23 <- movies23 |> mutate(age = 2023 - year)

p1 <- ggplot(data = movies23, aes(sample = age)) +
  geom_qq(col = "steelblue") + geom_qq_line(col = "red") +
  theme(aspect.ratio = 1) +
  labs(x = "Expected N(0,1)", y = "Years Since Release")

p2 <- ggplot(data = movies23, aes(x = age)) +
  geom_histogram(bins = 10, fill = "steelblue", col = "white") + 
  labs(x = "Years Since Release", y = "Number of Movies")

p3 <- ggplot(data = movies23, aes(x = age, y = "")) +
  geom_violin() +
  geom_boxplot(fill = "steelblue", width = 0.3,
               outlier.color = "steelblue", outlier.size = 3) +
  stat_summary(fun = "mean", geom = "point",
               shape = 23, size = 3, fill = "white") +
  labs(y = "", x = "Years Since Release")

p1 + (p2 / p3 + plot_layout(heights = c(2,1)))

Consider age = 2023-year

Consider \(log(age+1)\)

## add 1 to age, so all values are strictly positive
## otherwise all of the log(zeros) show up as missing 

p1 <- ggplot(data = movies23, aes(sample = log(age+1))) +
  geom_qq(col = "cornflowerblue") + geom_qq_line(col = "red") +
  theme(aspect.ratio = 1) +
  labs(x = "Expected N(0,1)", y = "log(Years Since Release + 1)")

p2 <- ggplot(data = movies23, aes(x = log(age + 1))) +
  geom_histogram(bins = 10, fill = "cornflowerblue", col = "white") + 
  labs(x = "log(Years Since Release + 1)", y = "Number of Movies")

p3 <- ggplot(data = movies23, aes(x = log(age + 1), y = "")) +
  geom_violin() +
  geom_boxplot(fill = "cornflowerblue", width = 0.3,
               outlier.color = "cornflowerblue", outlier.size = 3) +
  stat_summary(fun = "mean", geom = "point",
               shape = 23, size = 3, fill = "white") +
  labs(y = "", x = "log(Years Since Release + 1)")

p1 + (p2 / p3 + plot_layout(heights = c(2,1)))

Consider \(log(age+1)\)

Consider \(\sqrt{age}\) = square root

## Square root of 0 is just zero, so we're OK to plot sqrt(age)

p1 <- ggplot(data = movies23, aes(sample = sqrt(age))) +
  geom_qq(col = "slateblue") + geom_qq_line(col = "red") +
  theme(aspect.ratio = 1) +
  labs(x = "Expected N(0,1)", y = "sqrt(Years Since Release)")

p2 <- ggplot(data = movies23, aes(x = sqrt(age))) +
  geom_histogram(bins = 10, fill = "slateblue", col = "white") + 
  labs(x = "sqrt(Years Since Release)", y = "Number of Movies")

p3 <- ggplot(data = movies23, aes(x = sqrt(age), y = "")) +
  geom_violin() +
  geom_boxplot(fill = "slateblue", width = 0.3,
               outlier.color = "slateblue", outlier.size = 3) +
  stat_summary(fun = "mean", geom = "point",
               shape = 23, size = 3, fill = "white") +
  labs(y = "", x = "sqrt(Years Since Release)")

p1 + (p2 / p3 + plot_layout(heights = c(2,1)))

Consider \(\sqrt{age}\) = square root

Some Numerical Summaries for year

favstats(~ year, data = movies23)
  min   Q1 median   Q3  max    mean       sd   n missing
 1942 1996   2006 2013 2023 2002.98 14.39512 201       0
Hmisc::describe(movies23$year)
movies23$year 
       n  missing distinct     Info     Mean      Gmd      .05 
     201        0       54    0.999     2003     15.4     1977 
     .10      .25      .50      .75      .90      .95 
    1986     1996     2006     2013     2018     2019 

lowest : 1942 1954 1955 1963 1964, highest: 2019 2020 2021 2022 2023

What’s the mode?

The mode is the most common value - the value that is most often observed. Note that this addresses your Question 8.

movies23 |> count(year) |> arrange(desc(n))
# A tibble: 54 × 2
    year     n
   <dbl> <int>
 1  2010     9
 2  2016     9
 3  2012     8
 4  2018     8
 5  2019     8
 6  2001     7
 7  2007     7
 8  2008     7
 9  2009     7
10  2011     7
# ℹ 44 more rows

Oldest and Newest Movies?

movies23 |> select(film, year) |> arrange(desc(year)) |> head(6)
# A tibble: 6 × 2
  film                                 year
  <chr>                               <dbl>
1 Barbie                               2023
2 The Little Mermaid (2023)            2023
3 Everything, Everywhere, All at Once  2022
4 Loving Adults                        2022
5 A Man Called Otto                    2022
6 Thor: Love and Thunder               2022
movies23 |> select(film, year) |> arrange(year) |> head(4)
# A tibble: 4 × 2
  film             year
  <chr>           <dbl>
1 Casablanca       1942
2 Seven Samurai    1954
3 Pather Panchali  1955
4 8 1/2            1963

Additional Summaries for year

movies23 |> summarise(skew1 = (mean(year) - median(year))/sd(year))
# A tibble: 1 × 1
   skew1
   <dbl>
1 -0.210
movies23 |> count(year >= mean(year) - sd(year) &
                    year <= mean(year) + sd(year))
# A tibble: 2 × 2
  year >= mean(year) - sd(year) & year <= mean(year) + sd(year…¹     n
  <lgl>                                                          <int>
1 FALSE                                                             54
2 TRUE                                                             147
# ℹ abbreviated name:
#   ¹​`year >= mean(year) - sd(year) & year <= mean(year) + sd(year)`
118/159
[1] 0.7421384

Some Summaries for sqrt(age)

favstats(~ sqrt(age), data = movies23)
 min       Q1   median       Q3 max     mean       sd   n missing
   0 3.162278 4.123106 5.196152   9 4.179055 1.602553 201       0
Hmisc::describe(sqrt(movies23$age))
sqrt(movies23$age) 
       n  missing distinct     Info     Mean      Gmd      .05 
     201        0       54    0.999    4.179    1.805    2.000 
     .10      .25      .50      .75      .90      .95 
   2.236    3.162    4.123    5.196    6.083    6.782 

lowest : 0       1       1.41421 1.73205 2      
highest: 7.68115 7.74597 8.24621 8.30662 9      

Additional Summaries for sqrt(age)

movies23 |> 
  summarise(skew1 = (mean(sqrt(age)) - median(sqrt(age)))/sd(sqrt(age)))
# A tibble: 1 × 1
   skew1
   <dbl>
1 0.0349
movies23 |> count(sqrt(age) >= mean(sqrt(age)) - sd(sqrt(age)) &
                    sqrt(age) <= mean(sqrt(age)) + sd(sqrt(age)))
# A tibble: 2 × 2
  `&...`     n
  <lgl>  <int>
1 FALSE     65
2 TRUE     136
107/159
[1] 0.672956

Basic Exploration: length

Summarizing length

favstats(~ length, data = movies23)
 min  Q1 median  Q3 max     mean      sd   n missing
  70 104    118 138 207 123.8259 25.5426 201       0
Hmisc::describe(movies23$length)
movies23$length 
       n  missing distinct     Info     Mean      Gmd      .05 
     201        0       83        1    123.8    28.18       94 
     .10      .25      .50      .75      .90      .95 
      97      104      118      138      162      171 

lowest :  70  83  90  91  92, highest: 181 189 197 201 207

Longest / Shortest Movies?

movies23 |> select(film, length) |> arrange(desc(length)) |> head(3)
# A tibble: 3 × 2
  film                                      length
  <chr>                                      <dbl>
1 Seven Samurai                                207
2 Lord of the Rings: The Return of the King    201
3 Doctor Zhivago                               197
movies23 |> select(film, length) |> arrange(length) |> head(3)
# A tibble: 3 × 2
  film                               length
  <chr>                               <dbl>
1 The Gingerdead Man                     70
2 The Little Mermaid (1989)              83
3 Gifted Hands: The Ben Carson Story     90

Exploring imdb_ratings

Summaries for imdb_ratings

favstats(~ imdb_ratings, data = movies23)
  min     Q1 median     Q3     max     mean       sd   n missing
 4300 157000 357000 769000 2800000 543464.2 539744.7 201       0
Hmisc::describe(movies23$imdb_ratings)
movies23$imdb_ratings 
       n  missing distinct     Info     Mean      Gmd      .05 
     201        0      164        1   543464   547653    36000 
     .10      .25      .50      .75      .90      .95 
   65000   157000   357000   769000  1300000  1700000 

lowest :    4300    7000   12000   13000   15000
highest: 1900000 2000000 2100000 2500000 2800000

Most and Least rated movies?

movies23 |> select(film, imdb_ratings) |> arrange(desc(imdb_ratings)) |> head(5)
# A tibble: 5 × 2
  film                     imdb_ratings
  <chr>                           <dbl>
1 The Dark Knight               2800000
2 The Shawshank Redemption      2800000
3 Inception                     2500000
4 Pulp Fiction                  2100000
5 Interstellar                  2000000
movies23 |> select(film, imdb_ratings) |> arrange(imdb_ratings) |> head(5)
# A tibble: 5 × 2
  film                 imdb_ratings
  <chr>                       <dbl>
1 The Gingerdead Man           4300
2 House Party 2                7000
3 Loving Adults               12000
4 Brideshead Revisited        13000
5 Madea Goes To Jail          13000

Exploring imdb_stars

Summaries for imdb_stars

favstats(~ imdb_stars, data = movies23)
 min  Q1 median  Q3 max     mean        sd   n missing
 3.4 7.1    7.8 8.1 9.3 7.562687 0.8798015 201       0
Hmisc::describe(movies23$imdb_stars)
movies23$imdb_stars 
       n  missing distinct     Info     Mean      Gmd      .05 
     201        0       39    0.998    7.563   0.9307      6.1 
     .10      .25      .50      .75      .90      .95 
     6.5      7.1      7.8      8.1      8.5      8.7 

lowest : 3.4 3.6 4.5 5.2 5.3, highest: 8.8 8.9 9   9.2 9.3

Highest rated movies?

movies23 |> select(film, imdb_stars) |> arrange(desc(imdb_stars)) |> head(11)
# A tibble: 11 × 2
   film                                           imdb_stars
   <chr>                                               <dbl>
 1 The Shawshank Redemption                              9.3
 2 The Godfather                                         9.2
 3 The Dark Knight                                       9  
 4 Lord of the Rings: The Return of the King             9  
 5 Pulp Fiction                                          8.9
 6 Inception                                             8.8
 7 Lord of the Rings: The Fellowship of the Ring         8.8
 8 Lord of the Rings: The Two Towers                     8.8
 9 Interstellar                                          8.7
10 The Matrix                                            8.7
11 Star Wars: Episode V - The Empire Strikes Back        8.7

Lowest rated movies?

movies23 |> select(film, imdb_stars) |> arrange(imdb_stars) |> head(9)
# A tibble: 9 × 2
  film                                           imdb_stars
  <chr>                                               <dbl>
1 The Gingerdead Man                                    3.4
2 The Room                                              3.6
3 Madea Goes To Jail                                    4.5
4 High School Musical 2                                 5.2
5 House Party 2                                         5.3
6 Monte Carlo                                           5.8
7 Murder Mystery                                        6  
8 Night at the Museum: Battle of the Smithsonian        6  
9 The Rite                                              6  

What can we do with imdb_categories?

What is in imdb_categories?

movies23 |> tabyl(imdb_categories)
              imdb_categories  n     percent
            Action, Adventure  2 0.009950249
    Action, Adventure, Comedy  3 0.014925373
     Action, Adventure, Drama  6 0.029850746
   Action, Adventure, Fantasy  7 0.034825871
   Action, Adventure, Mystery  1 0.004975124
    Action, Adventure, Sci-Fi  9 0.044776119
  Action, Adventure, Thriller  4 0.019900498
        Action, Comedy, Crime  1 0.004975124
      Action, Comedy, Fantasy  2 0.009950249
      Action, Comedy, Mystery  1 0.004975124
         Action, Crime, Drama  1 0.004975124
        Action, Crime, Sci-Fi  1 0.004975124
      Action, Crime, Thriller  2 0.009950249
                Action, Drama  2 0.009950249
       Action, Drama, Mystery  2 0.009950249
        Action, Drama, Sci-Fi  2 0.009950249
       Action, Drama, Western  1 0.004975124
       Action, Horror, Sci-Fi  1 0.004975124
               Action, Sci-Fi  1 0.004975124
     Adventure, Comedy, Crime  2 0.009950249
     Adventure, Comedy, Drama  2 0.009950249
    Adventure, Comedy, Family  3 0.014925373
   Adventure, Comedy, Fantasy  1 0.004975124
    Adventure, Comedy, Sci-Fi  2 0.009950249
     Adventure, Drama, Sci-Fi  1 0.004975124
        Adventure, Drama, War  1 0.004975124
   Adventure, Family, Fantasy  4 0.019900498
           Adventure, Fantasy  1 0.004975124
            Adventure, Sci-Fi  1 0.004975124
 Animation, Action, Adventure  2 0.009950249
 Animation, Adventure, Comedy  9 0.044776119
  Animation, Adventure, Drama  3 0.014925373
 Animation, Adventure, Family  3 0.014925373
    Animation, Drama, Fantasy  1 0.004975124
             Biography, Drama  2 0.009950249
    Biography, Drama, History  3 0.014925373
      Biography, Drama, Music  2 0.009950249
    Biography, Drama, Musical  1 0.004975124
      Biography, Drama, Sport  2 0.009950249
   Biography, Drama, Thriller  1 0.004975124
                       Comedy  5 0.024875622
                Comedy, Crime  1 0.004975124
         Comedy, Crime, Drama  1 0.004975124
         Comedy, Crime, Sport  1 0.004975124
                Comedy, Drama  8 0.039800995
        Comedy, Drama, Family  1 0.004975124
       Comedy, Drama, Fantasy  4 0.019900498
         Comedy, Drama, Music  4 0.019900498
       Comedy, Drama, Romance  9 0.044776119
               Comedy, Family  1 0.004975124
      Comedy, Family, Fantasy  3 0.014925373
              Comedy, Fantasy  1 0.004975124
      Comedy, Fantasy, Horror  1 0.004975124
                Comedy, Music  1 0.004975124
     Comedy, Musical, Romance  2 0.009950249
              Comedy, Romance  2 0.009950249
      Comedy, Romance, Sci-Fi  1 0.004975124
                 Crime, Drama  2 0.009950249
        Crime, Drama, Fantasy  1 0.004975124
        Crime, Drama, Mystery  2 0.009950249
       Crime, Drama, Thriller  6 0.029850746
                        Drama 10 0.049751244
                Drama, Family  1 0.004975124
       Drama, Family, Musical  1 0.004975124
       Drama, Family, Romance  1 0.004975124
       Drama, Fantasy, Horror  1 0.004975124
      Drama, Fantasy, Romance  1 0.004975124
       Drama, Horror, Mystery  2 0.009950249
                 Drama, Music  1 0.004975124
        Drama, Music, Musical  1 0.004975124
        Drama, Music, Romance  2 0.009950249
      Drama, Musical, Romance  1 0.004975124
               Drama, Mystery  1 0.004975124
     Drama, Mystery, Thriller  1 0.004975124
               Drama, Romance 11 0.054726368
       Drama, Romance, Sci-Fi  2 0.009950249
          Drama, Romance, War  2 0.009950249
      Drama, Sci-Fi, Thriller  2 0.009950249
              Drama, Thriller  2 0.009950249
                   Drama, War  1 0.004975124
    Horror, Mystery, Thriller  1 0.004975124
               Horror, Sci-Fi  1 0.004975124
     Horror, Sci-Fi, Thriller  1 0.004975124
            Mystery, Thriller  1 0.004975124

Is imdb_categories useful?

movies23 |> tabyl(imdb_categories) |> arrange(-n) |> adorn_pct_formatting()
              imdb_categories  n percent
               Drama, Romance 11    5.5%
                        Drama 10    5.0%
    Action, Adventure, Sci-Fi  9    4.5%
 Animation, Adventure, Comedy  9    4.5%
       Comedy, Drama, Romance  9    4.5%
                Comedy, Drama  8    4.0%
   Action, Adventure, Fantasy  7    3.5%
     Action, Adventure, Drama  6    3.0%
       Crime, Drama, Thriller  6    3.0%
                       Comedy  5    2.5%
  Action, Adventure, Thriller  4    2.0%
   Adventure, Family, Fantasy  4    2.0%
       Comedy, Drama, Fantasy  4    2.0%
         Comedy, Drama, Music  4    2.0%
    Action, Adventure, Comedy  3    1.5%
    Adventure, Comedy, Family  3    1.5%
  Animation, Adventure, Drama  3    1.5%
 Animation, Adventure, Family  3    1.5%
    Biography, Drama, History  3    1.5%
      Comedy, Family, Fantasy  3    1.5%
            Action, Adventure  2    1.0%
      Action, Comedy, Fantasy  2    1.0%
      Action, Crime, Thriller  2    1.0%
                Action, Drama  2    1.0%
       Action, Drama, Mystery  2    1.0%
        Action, Drama, Sci-Fi  2    1.0%
     Adventure, Comedy, Crime  2    1.0%
     Adventure, Comedy, Drama  2    1.0%
    Adventure, Comedy, Sci-Fi  2    1.0%
 Animation, Action, Adventure  2    1.0%
             Biography, Drama  2    1.0%
      Biography, Drama, Music  2    1.0%
      Biography, Drama, Sport  2    1.0%
     Comedy, Musical, Romance  2    1.0%
              Comedy, Romance  2    1.0%
                 Crime, Drama  2    1.0%
        Crime, Drama, Mystery  2    1.0%
       Drama, Horror, Mystery  2    1.0%
        Drama, Music, Romance  2    1.0%
       Drama, Romance, Sci-Fi  2    1.0%
          Drama, Romance, War  2    1.0%
      Drama, Sci-Fi, Thriller  2    1.0%
              Drama, Thriller  2    1.0%
   Action, Adventure, Mystery  1    0.5%
        Action, Comedy, Crime  1    0.5%
      Action, Comedy, Mystery  1    0.5%
         Action, Crime, Drama  1    0.5%
        Action, Crime, Sci-Fi  1    0.5%
       Action, Drama, Western  1    0.5%
       Action, Horror, Sci-Fi  1    0.5%
               Action, Sci-Fi  1    0.5%
   Adventure, Comedy, Fantasy  1    0.5%
     Adventure, Drama, Sci-Fi  1    0.5%
        Adventure, Drama, War  1    0.5%
           Adventure, Fantasy  1    0.5%
            Adventure, Sci-Fi  1    0.5%
    Animation, Drama, Fantasy  1    0.5%
    Biography, Drama, Musical  1    0.5%
   Biography, Drama, Thriller  1    0.5%
                Comedy, Crime  1    0.5%
         Comedy, Crime, Drama  1    0.5%
         Comedy, Crime, Sport  1    0.5%
        Comedy, Drama, Family  1    0.5%
               Comedy, Family  1    0.5%
              Comedy, Fantasy  1    0.5%
      Comedy, Fantasy, Horror  1    0.5%
                Comedy, Music  1    0.5%
      Comedy, Romance, Sci-Fi  1    0.5%
        Crime, Drama, Fantasy  1    0.5%
                Drama, Family  1    0.5%
       Drama, Family, Musical  1    0.5%
       Drama, Family, Romance  1    0.5%
       Drama, Fantasy, Horror  1    0.5%
      Drama, Fantasy, Romance  1    0.5%
                 Drama, Music  1    0.5%
        Drama, Music, Musical  1    0.5%
      Drama, Musical, Romance  1    0.5%
               Drama, Mystery  1    0.5%
     Drama, Mystery, Thriller  1    0.5%
                   Drama, War  1    0.5%
    Horror, Mystery, Thriller  1    0.5%
               Horror, Sci-Fi  1    0.5%
     Horror, Sci-Fi, Thriller  1    0.5%
            Mystery, Thriller  1    0.5%

Split into separate columns?

  • Each movie has up to three categories identified in imdb_categories.
  • There are 20 different categories represented across our 201 movies.
str_split_fixed(movies23$imdb_categories, ", ", n = 3) |> head()
     [,1]        [,2]      [,3]     
[1,] "Comedy"    "Drama"   ""       
[2,] "Drama"     ""        ""       
[3,] "Comedy"    "Drama"   "Romance"
[4,] "Adventure" "Sci-Fi"  ""       
[5,] "Drama"     "Mystery" ""       
[6,] "Comedy"    "Drama"   "Fantasy"

Can we create an indicator for Action?

We want:

  • a variable which is 1 if the movie’s imdb_categories list includes Action and 0 otherwise
  • and we’ll call it action.
movies23 <- movies23 |> 
  mutate(action = as.numeric(str_detect(imdb_categories, fixed("Action"))))

Check our coding?

movies23 |> select(film_id, film, imdb_categories, action) |> slice(128:137)
# A tibble: 10 × 4
   film_id film                                 imdb_categories action
   <chr>   <chr>                                <chr>            <dbl>
 1 128     Mission Impossible: Ghost Protocol   Action, Advent…      1
 2 129     Moneyball                            Biography, Dra…      0
 3 130     Monte Carlo                          Adventure, Com…      0
 4 131     Moonlight                            Drama                0
 5 132     Murder Mystery                       Action, Comedy…      1
 6 133     My Big Fat Greek Wedding             Comedy, Drama,…      0
 7 134     My Fair Lady                         Drama, Family,…      0
 8 135     Mystery Men                          Action, Comedy…      1
 9 136     National Lampoon's Christmas Vacati… Comedy               0
10 137     Night at the Museum: Battle of the … Adventure, Com…      0

How many “Action” movies?

movies23 |> tabyl(action) 
 action   n   percent
      0 150 0.7462687
      1  51 0.2537313

Actually, those are proportions, not percentages.

movies23 |> tabyl(action) |> adorn_pct_formatting()
 action   n percent
      0 150   74.6%
      1  51   25.4%

OK. We need to do this for all 20 genres specified in imdb_categories.

Indicators of All 20 Genres

movies23 <- movies23 |> 
  mutate(action = as.numeric(str_detect(imdb_categories, fixed("Action"))),
         adventure = as.numeric(str_detect(imdb_categories, fixed("Adventure"))),
         animation = as.numeric(str_detect(imdb_categories, fixed("Animation"))),
         biography = as.numeric(str_detect(imdb_categories, fixed("Biography"))),
         comedy = as.numeric(str_detect(imdb_categories, fixed("Comedy"))),
         crime = as.numeric(str_detect(imdb_categories, fixed("Crime"))),
         drama = as.numeric(str_detect(imdb_categories, fixed("Drama"))),
         family = as.numeric(str_detect(imdb_categories, fixed("Family"))),
         fantasy = as.numeric(str_detect(imdb_categories, fixed("Fantasy"))),
         history = as.numeric(str_detect(imdb_categories, fixed("History"))),
         horror = as.numeric(str_detect(imdb_categories, fixed("Horror"))),
         music = as.numeric(str_detect(imdb_categories, fixed("Music"))),
         musical = as.numeric(str_detect(imdb_categories, fixed("Musical"))),
         mystery = as.numeric(str_detect(imdb_categories, fixed("Mystery"))),
         romance = as.numeric(str_detect(imdb_categories, fixed("Romance"))),
         scifi = as.numeric(str_detect(imdb_categories, fixed("Sci-Fi"))),
         sport = as.numeric(str_detect(imdb_categories, fixed("Sport"))),
         thriller = as.numeric(str_detect(imdb_categories, fixed("Thriller"))),
         war = as.numeric(str_detect(imdb_categories, fixed("War"))),
         western = as.numeric(str_detect(imdb_categories, fixed("Western")))
  )

Summing Up Genres, Horizontally

movies23 |> 
  summarise(across(.cols = action:western, .fns = sum))
# A tibble: 1 × 20
  action adventure animation biography comedy crime drama family
   <dbl>     <dbl>     <dbl>     <dbl>  <dbl> <dbl> <dbl>  <dbl>
1     51        67        18        11     72    21   115     18
# ℹ 12 more variables: fantasy <dbl>, history <dbl>, horror <dbl>,
#   music <dbl>, musical <dbl>, mystery <dbl>, romance <dbl>,
#   scifi <dbl>, sport <dbl>, thriller <dbl>, war <dbl>,
#   western <dbl>

Sorted Counts of Movies by Genre

movies23 |> 
  summarise(across(.cols = action:western, .fns = sum)) |>
  t() |> as.data.frame() |> rename(count = V1) |> arrange(-count) 
          count
drama       115
comedy       72
adventure    67
action       51
romance      34
fantasy      28
scifi        25
crime        21
thriller     21
animation    18
family       18
music        16
mystery      12
biography    11
horror        8
musical       6
war           4
history       3
sport         3
western       1

Question 1: Are movies getting longer?

Movie Lengths, over Time (ver. 1)

Plot the association of year and length

ggplot(movies23, aes(x = year, y = length)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x, col = "red") +
  geom_smooth(method = "loess", se = F, formula = y ~ x, col = "blue") +
  labs(x = "Year of Release", y = "Length (in minutes)",
       title = "Favorite Movies: Length and Year of Release")

Add the correlation in a subtitle

ggplot(movies23, aes(x = year, y = length)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x, col = "red") +
  geom_smooth(method = "loess", se = F, formula = y ~ x, col = "blue") +
  labs(x = "Year of Release", y = "Length (in minutes)",
       title = "Favorite Movies: Length and Year of Release",
       subtitle = str_glue("Pearson Correlation = ", round_half_up(
         cor(movies23$year, movies23$length),3)))

Add the correlation in a subtitle

Use film_id labels instead of points

ggplot(movies23, aes(x = year, y = length, label = film_id)) +
  geom_label() +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x, col = "red") +
  geom_smooth(method = "loess", se = F, formula = y ~ x, col = "blue") +
  labs(x = "Year of Release", y = "Length (in minutes)",
       title = "Favorite Movies: Length and Year of Release",
       subtitle = str_glue("Pearson Correlation = ", round_half_up(
         cor(movies23$year, movies23$length),3)))

Use film_id labels instead of points

Use text to show film names

ggplot(movies23, aes(x = year, y = length, label = film)) +
  geom_point(col = "coral") +
  geom_text() +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x, col = "red") +
  geom_smooth(method = "loess", se = F, formula = y ~ x, col = "blue") +
  labs(x = "Year of Release", y = "Length (in minutes)",
       title = "Favorite Movies: Length and Year of Release",
       subtitle = str_glue("Pearson Correlation = ", round_half_up(
         cor(movies23$year, movies23$length),3)))

Use text to show film names

Show film text for selected movies

ggplot(movies23, aes(x = year, y = length, label = film)) +
  geom_point(col = "coral") +
  geom_text(data = movies23 |> filter(year < 1975 | length > 180)) +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x, col = "red") +
  geom_smooth(method = "loess", se = F, formula = y ~ x, col = "blue") +
  labs(x = "Year of Release", y = "Length (in minutes)",
       title = "Favorite Movies: Length and Year of Release",
       subtitle = str_glue("Pearson Correlation = ", round_half_up(
         cor(movies23$year, movies23$length),3)))

Show film text for selected movies

Try geom_text_repel()

ggplot(movies23, aes(x = year, y = length, label = film)) +
  geom_point(col = "coral") +
  geom_text_repel(data = movies23 |> filter(year < 1975 | length > 180)) +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x, col = "red") +
  geom_smooth(method = "loess", se = F, formula = y ~ x, col = "blue") +
  labs(x = "Year of Release", y = "Length (in minutes)",
       title = "Favorite Movies: Length and Year of Release",
       subtitle = str_glue("Pearson Correlation = ", round_half_up(
         cor(movies23$year, movies23$length),3)))

Try geom_text_repel()

geom_label_repel and colors?

ggplot(movies23, aes(x = year, y = length, label = film)) +
  geom_point(col = "coral") +
  geom_point(data = movies23 |> filter(year < 1975 | length > 180), 
             color = "darkgreen") +
  geom_label_repel(data = movies23 |> filter(year < 1975 | length > 180), 
                  color = "darkgreen") +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x, col = "red") +
  geom_smooth(method = "loess", se = F, formula = y ~ x, col = "blue") +
  labs(x = "Year of Release", y = "Length (in minutes)",
       title = "Favorite Movies: Length and Year of Release",
       subtitle = str_glue("Pearson Correlation = ", round_half_up(
         cor(movies23$year, movies23$length),3)))

geom_label_repel and colors?

Model for Length, using Year?

m1 <- lm(length ~ year, data = movies23)
tidy(m1, conf.int = TRUE, conf.level = 0.90) |> gt()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 289.92103409 251.6727534 1.1519762 0.2507129 -125.9799556 705.8220238
year -0.08292402 0.1256459 -0.6599818 0.5100287 -0.2905598 0.1247117
glance(m1) |> gt()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.002184043 -0.002830107 25.57872 0.4355759 0.5100287 1 -935.7956 1877.591 1887.501 130199.9 199 201

Year and Length for Action/non-Action

ggplot(movies23, aes(x = year, y = length, col = factor(action))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x, col = "red") +
  facet_wrap(~ action, labeller = "label_both") +
  guides(col = "none") +
  scale_color_manual(values = c("plum", "steelblue")) +
  labs(x = "Year of Release", y = "Length (in minutes)",
       title = "Favorite Movies: Length and Year of Release",
       subtitle = str_glue("Comparing Action movies (n = ", 
                       sum(movies23$action), ") to All Others (n = ", 
                       nrow(movies23) - sum(movies23$action), ")"))

Year and Length for Action/non-Action

Year and Length for Adventure or Not?

Interaction of Centered Year & Adventure

movies23 <- movies23 |> mutate(year_c = year - mean(year))

m2 <- lm(length ~ year_c * adventure, data = movies23)
m2

Call:
lm(formula = length ~ year_c * adventure, data = movies23)

Coefficients:
     (Intercept)            year_c         adventure  
        122.2920           -0.2062            3.7769  
year_c:adventure  
          0.4239  

Coefficients and Summaries

tidy(m2, conf.int = TRUE, conf.level = 0.90)
# A tibble: 4 × 7
  term       estimate std.error statistic   p.value conf.low conf.high
  <chr>         <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercep…  122.        2.21     55.4   3.98e-122 119.      126.    
2 year_c       -0.206     0.146    -1.41  1.60e-  1  -0.448     0.0353
3 adventure     3.78      3.85      0.982 3.27e-  1  -2.58     10.1   
4 year_c:ad…    0.424     0.287     1.48  1.41e-  1  -0.0506    0.898 
glance(m2) |> select(r.squared, sigma, AIC, nobs, df, df.residual)
# A tibble: 1 × 6
  r.squared sigma   AIC  nobs    df df.residual
      <dbl> <dbl> <dbl> <int> <dbl>       <int>
1    0.0193  25.5 1878.   201     3         197

Tweak the Question?

Are movies made prior to 2000 longer or shorter than movies after 2000?

movies23 <- movies23 |>
  mutate(before2000 = factor(ifelse(year < 2000, "Early", "Late")))

ggplot(movies23, aes(x = before2000, y = length)) +
  geom_violin() +
  geom_boxplot(aes(fill = before2000), notch = TRUE, 
               width = 0.3, outlier.size = 3) +
  stat_summary(fun = "mean", geom = "point", 
               shape = 23, size = 3, fill = "white") +
  scale_fill_viridis_d(alpha = 0.5) +
  guides(fill = "none") +
  coord_flip() +
  labs(x = "", y = "Length (in minutes)")

Tweak the Question?

Meaningful difference in means?

favstats(length ~ before2000, data = movies23) |> gt()
before2000 min Q1 median Q3 max mean sd n missing
Early 83 102 117 133.25 207 123.3788 28.01470 66 0
Late 70 105 118 139.50 201 124.0444 24.35002 135 0
m3 <- lm(length ~ before2000, data = movies23)
tidy(m3, conf.int = T, conf.level = 0.90) |> gt() |> fmt_number(decimals = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 123.379 3.152 39.146 0.000 118.170 128.587
before2000Late 0.666 3.846 0.173 0.863 −5.690 7.021
glance(m3) |> 
  round_half_up(digits = c(4, 4, 2, 2, 3, 0, 0, 1, 1, 0, 0, 0)) |> gt()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
2e-04 -0.0049 25.6 0.03 0.863 1 -936 1878 1887.9 130465 199 201

Compare Means with Bootstrap?

t.test(length ~ before2000, data = movies23, var.equal = TRUE, 
       conf.int = TRUE, conf.level = 0.90) |> 
  tidy() |> gt() |> fmt_number(decimals = 3)
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
−0.666 123.379 124.044 −0.173 0.863 199.000 −7.021 5.690 Two Sample t-test two.sided
## need Love-boost.R for bootdif() function

source("https://raw.githubusercontent.com/THOMASELOVE/431-data/main/data-and-code/Love-boost.R")

set.seed(20230928)
bootdif(y = movies23$length, g = movies23$before2000, 
        conf.level = 0.90, B.reps = 1000) 
Mean Difference            0.05            0.95 
      0.6656566      -6.0256061       7.3145286 

Question 2: Which categories get higher ratings?

Do Dramas have higher ratings than Comedies?

movies23 |> tabyl(comedy, drama) |> adorn_title()
        drama   
 comedy     0  1
      0    43 86
      1    43 29
  • What should we do about this?
  • Exclude the Movies that are both, or neither (Approach A)
  • Include all of the Movies, making 4 categories (Approach B)

Approach A

Do Dramas have higher ratings (more imdb_stars) than Comedies?

  • excluding the Movies that are both, or neither…
mov_dc1 <- movies23 |>
  filter(comedy + drama == 1)

mov_dc1 |> tabyl(comedy, drama) |> adorn_title()
        drama   
 comedy     0  1
      0     0 86
      1    43  0

Approach A (continued)

mov_dc1 <- mov_dc1 |> 
  mutate(genre = fct_recode(factor(comedy), "Comedy" = "1", "Drama" = "0"))

mov_dc1 |> count(genre, comedy, drama)
# A tibble: 2 × 4
  genre  comedy drama     n
  <fct>   <dbl> <dbl> <int>
1 Drama       0     1    86
2 Comedy      1     0    43

Approach A (Stars by Genre)

ggplot(data = mov_dc1, aes(x = imdb_stars, y = genre, 
                          fill = genre, height = after_stat(density))) +
  geom_density_ridges(scale = 0.8) +
  scale_fill_viridis_d(option = "A") + theme_ridges()

Approach A (Stars by Genre)

favstats(imdb_stars ~ genre, data = mov_dc1) |> gt()
genre min Q1 median Q3 max mean sd n missing
Drama 3.6 7.525 7.9 8.3 9.3 7.859302 0.7711531 86 0
Comedy 3.4 6.500 7.1 7.7 8.5 7.018605 0.8872794 43 0
m4 <- lm(imdb_stars ~ genre, data = mov_dc1)

tidy(m4, conf.int = T, conf.level = 0.9) |> gt()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 7.8593023 0.08749535 89.825372 8.544540e-117 7.714328 8.0042769
genreComedy -0.8406977 0.15154639 -5.547461 1.612117e-07 -1.091801 -0.5895943

T test and Bootstrap 90% CIs?

favstats(imdb_stars ~ genre, data = mov_dc1) |> gt()
genre min Q1 median Q3 max mean sd n missing
Drama 3.6 7.525 7.9 8.3 9.3 7.859302 0.7711531 86 0
Comedy 3.4 6.500 7.1 7.7 8.5 7.018605 0.8872794 43 0
t.test(imdb_stars ~ genre, data = mov_dc1,
       var.equal = TRUE, conf.level = 0.90) |>
  tidy(conf.int = TRUE) |> gt()
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
0.8406977 7.859302 7.018605 5.547461 1.612117e-07 127 0.5895943 1.091801 Two Sample t-test two.sided
set.seed(4322023)
bootdif(y = mov_dc1$imdb_stars, g = mov_dc1$genre, 
        conf.level = 0.90, B.reps = 2000)
Mean Difference            0.05            0.95 
     -0.8406977      -1.1197674      -0.5824419 

Approach B

Do Dramas have higher ratings (more imdb_stars) than Comedies?

  • including all of the Movies, creating four categories
mov_dc2 <- movies23 |> 
  mutate(genre4 = fct_recode(factor(10*comedy + drama),
                             "Comedy only" = "10",
                             "Drama only" = "1",
                             "Both" = "11",
                             "Neither" = "0"))

Check that We Recoded Correctly

mov_dc2 |> count(comedy, drama, genre4)
# A tibble: 4 × 4
  comedy drama genre4          n
   <dbl> <dbl> <fct>       <int>
1      0     0 Neither        43
2      0     1 Drama only     86
3      1     0 Comedy only    43
4      1     1 Both           29

Approach B (Stars by Genre)

ggplot(data = mov_dc2, aes(x = imdb_stars, y = genre4, 
                          fill = genre4, height = after_stat(density))) +
  geom_density_ridges(scale = 0.8) +
  scale_fill_viridis_d(option = "A") + theme_ridges()

Approach B (Stars by Genre)

favstats(imdb_stars ~ genre4, data = mov_dc2) |> gt()
genre4 min Q1 median Q3 max mean sd n missing
Neither 6.1 7.300 7.8 8.35 8.8 7.702326 0.7265340 43 0
Drama only 3.6 7.525 7.9 8.30 9.3 7.859302 0.7711531 86 0
Comedy only 3.4 6.500 7.1 7.70 8.5 7.018605 0.8872794 43 0
Both 4.5 6.900 7.6 7.90 8.6 7.282759 0.9565821 29 0
m5 <- lm(imdb_stars ~ genre4, data = mov_dc2)
tidy(m5, conf.int = T, conf.level = 0.9) |> gt()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 7.7023256 0.1245480 61.842241 5.470725e-131 7.49649446 7.90815670
genre4Drama only 0.1569767 0.1525395 1.029089 3.046996e-01 -0.09511386 0.40906735
genre4Comedy only -0.6837209 0.1761374 -3.881747 1.414359e-04 -0.97481009 -0.39263177
genre4Both -0.4195670 0.1962474 -2.137949 3.375296e-02 -0.74389036 -0.09524356

Question 4: Which categories get the most ratings?

Ratings by Category

  • Comparing Drama to Comedy again
ggplot(data = mov_dc2, aes(x = imdb_ratings/1000, y = genre4)) +
  geom_violin(aes(fill = genre4)) +
  geom_boxplot(width = 0.3, notch = TRUE, outlier.size = 2) +
  stat_summary(fun = "mean", geom = "point", 
               shape = 21, size = 2, fill = "purple") +
  scale_fill_brewer(palette = "Accent") +
  guides(fill = "none") +
  labs(x = "IMDB ratings (in 1000s)", y = "Genre",
       title = "Boxplot with Violin for 201 Movies")

Ratings by Category

Ridgeline Plot: IMDB Ratings

ggplot(data = mov_dc2, aes(x = imdb_ratings/1000, y = genre4, 
                          fill = genre4, height = after_stat(density))) +
  geom_density_ridges(scale = 0.8) +
  scale_fill_viridis_d(option = "D") + theme_ridges() +
  guides(fill = "none") + 
  labs(x = "IMDB Ratings (in thousands)")

Ridgeline Plot: IMDB Ratings

A Few More Scatterplots

Some Other Questions

  1. Are longer movies rated more highly? (length, imdb_stars)
  2. Do more recent movies get more ratings? (year, imdb_ratings)
  3. Do more recent movies have higher ratings? (year, imdb_stars)
  4. Are ratings and stars associated? (imdb_ratings, imdb_stars)

Q3: Length vs. Average Rating?

ggplot(movies23, aes(x = length, y = imdb_stars)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, formula = y ~ x, col = "red") +
  geom_smooth(method = "loess", se = FALSE, formula = y ~ x, col = "blue") +
  labs(x = "Length (in minutes)", y = "IMDB Stars (0-10 scale)",
       title = "Favorite Movies: Length and IMDB Stars",
       subtitle = str_glue("Pearson Correlation = ", round_half_up(
         cor(movies23$length, movies23$imdb_stars), 3)))

Q3: Length vs. Average Rating?

Q5: Year vs. # of Star Ratings?

ggplot(movies23, aes(x = year, y = imdb_ratings/1000)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, formula = y ~ x, col = "red") +
  geom_smooth(method = "loess", se = FALSE, formula = y ~ x, col = "blue") +
  labs(x = "Year of Release", y = "IMDB Ratings (thousands)",
       title = "Favorite Movies: IMDB Ratings and Year of Release",
       subtitle = str_glue("Pearson Correlation = ", round_half_up(
         cor(movies23$year, movies23$imdb_ratings), 3)))

Q5: Year vs. # of Star Ratings?

Q6: Year vs. Number of Stars?

ggplot(movies23, aes(x = year, y = imdb_stars)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, formula = y ~ x, col = "red") +
  geom_smooth(method = "loess", se = FALSE, formula = y ~ x, col = "blue") +
  labs(x = "Year of Release", y = "IMDB Stars",
       title = "Favorite Movies: IMDB Stars and Year of Release",
       subtitle = str_glue("Pearson Correlation = ", round_half_up(
         cor(movies23$year, movies23$imdb_stars), 3)))

Q6: Year vs. Number of Stars?

Q7: Ratings vs. IMDB Stars?

ggplot(movies23, aes(x = imdb_ratings/1000, y = imdb_stars)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, formula = y ~ x, col = "red") +
  geom_smooth(method = "loess", se = FALSE, formula = y ~ x, col = "blue") +
  labs(x = "IMDB Ratings (in thousands)", y = "IMDB Stars",
       title = "Favorite Movies: Number of Ratings and Stars",
       subtitle = str_glue("Pearson Correlation = ", round_half_up(
         cor(movies23$imdb_ratings, movies23$imdb_stars), 3)))

Q7: Ratings vs. IMDB Stars?

Session Information with xfun

Either the xfun or sessioninfo version of session_info() can be used.

xfun::session_info()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)


Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

Package version:
  askpass_1.2.0       backports_1.4.1     base64enc_0.1-3    
  bigD_0.2.0          bit_4.0.5           bit64_4.0.5        
  bitops_1.0.7        blob_1.2.4          broom_1.0.5        
  bslib_0.5.1         cachem_1.0.8        callr_3.7.3        
  cellranger_1.1.0    checkmate_2.2.0     cli_3.6.1          
  clipr_0.8.0         cluster_2.1.4       colorspace_2.1-0   
  commonmark_1.9.0    compiler_4.3.1      conflicted_1.2.0   
  cpp11_0.4.6         crayon_1.5.2        curl_5.0.2         
  data.table_1.14.8   DBI_1.1.3           dbplyr_2.3.4       
  digest_0.6.33       dplyr_1.1.2         dtplyr_1.3.1       
  ellipsis_0.3.2      evaluate_0.21       fansi_1.0.4        
  farver_2.1.1        fastmap_1.1.1       fontawesome_0.5.2  
  forcats_1.0.0       foreign_0.8-84      Formula_1.2-5      
  fs_1.6.3            gargle_1.5.2        generics_0.1.3     
  ggforce_0.4.1       ggformula_0.10.4    ggplot2_3.4.3      
  ggrepel_0.9.3       ggridges_0.5.4      ggstance_0.3.6     
  glue_1.6.2          googledrive_2.1.1   googlesheets4_1.1.1
  graphics_4.3.1      grDevices_4.3.1     grid_4.3.1         
  gridExtra_2.3       gt_0.9.0            gtable_0.3.4       
  haven_2.5.3         highr_0.10          Hmisc_5.1-0        
  hms_1.1.3           htmlTable_2.4.1     htmltools_0.5.6    
  htmlwidgets_1.6.2   httr_1.4.7          ids_1.0.1          
  isoband_0.2.7       janitor_2.2.0       jquerylib_0.1.4    
  jsonlite_1.8.7      juicyjuice_0.1.0    knitr_1.44         
  labeling_0.4.3      labelled_2.12.0     lattice_0.21-8     
  lifecycle_1.0.3     lubridate_1.9.2     magrittr_2.0.3     
  markdown_1.8        MASS_7.3-60         Matrix_1.6-1       
  memoise_2.0.1       methods_4.3.1       mgcv_1.8-42        
  mime_0.12           modelr_0.1.11       mosaic_1.8.4.2     
  mosaicCore_0.9.2.1  mosaicData_0.20.3   munsell_0.5.0      
  naniar_1.0.0        nlme_3.1-162        nnet_7.3-19        
  norm_1.0.11.1       openssl_2.1.1       patchwork_1.1.3    
  pillar_1.9.0        pkgconfig_2.0.3     plyr_1.8.8         
  polyclip_1.10-6     prettyunits_1.2.0   processx_3.8.2     
  progress_1.2.2      ps_1.7.5            purrr_1.0.2        
  R6_2.5.1            ragg_1.2.5          rappdirs_0.3.3     
  RColorBrewer_1.1-3  Rcpp_1.0.11         RcppEigen_0.3.3.9.3
  reactable_0.4.4     reactR_0.4.4        readr_2.1.4        
  readxl_1.4.3        rematch_2.0.0       rematch2_2.1.2     
  reprex_2.0.2        rlang_1.1.1         rmarkdown_2.25     
  rpart_4.1.19        rstudioapi_0.15.0   rvest_1.0.3        
  sass_0.4.7          scales_1.2.1        selectr_0.4.2      
  snakecase_0.11.1    splines_4.3.1       stats_4.3.1        
  stringi_1.7.12      stringr_1.5.0       sys_3.4.2          
  systemfonts_1.0.4   textshaping_0.3.6   tibble_3.2.1       
  tidyr_1.3.0         tidyselect_1.2.0    tidyverse_2.0.0    
  timechange_0.2.0    tinytex_0.46        tools_4.3.1        
  tweenr_2.0.2        tzdb_0.4.0          UpSetR_1.4.0       
  utf8_1.2.3          utils_4.3.1         uuid_1.1.1         
  V8_4.3.3            vctrs_0.6.3         viridis_0.6.4      
  viridisLite_0.4.2   visdat_0.6.0        vroom_1.6.3        
  withr_2.5.1         xfun_0.40           xml2_1.3.5         
  yaml_2.3.7         

Using sessioninfo

Either the xfun or sessioninfo version of session_info() can be used.

sessioninfo::session_info()
─ Session info ─────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16 ucrt)
 os       Windows 11 x64 (build 22621)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/New_York
 date     2023-09-28
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ─────────────────────────────────────────────────────────
 package       * version date (UTC) lib source
 backports       1.4.1   2021-12-13 [1] CRAN (R 4.3.0)
 base64enc       0.1-3   2015-07-28 [1] CRAN (R 4.3.0)
 broom         * 1.0.5   2023-06-09 [1] CRAN (R 4.3.1)
 cellranger      1.1.0   2016-07-27 [1] CRAN (R 4.3.1)
 checkmate       2.2.0   2023-04-27 [1] CRAN (R 4.3.1)
 cli             3.6.1   2023-03-23 [1] CRAN (R 4.3.1)
 cluster         2.1.4   2022-08-22 [2] CRAN (R 4.3.1)
 colorspace      2.1-0   2023-01-23 [1] CRAN (R 4.3.1)
 curl            5.0.2   2023-08-14 [1] CRAN (R 4.3.1)
 data.table      1.14.8  2023-02-17 [1] CRAN (R 4.3.1)
 digest          0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
 dplyr         * 1.1.2   2023-04-20 [1] CRAN (R 4.3.1)
 ellipsis        0.3.2   2021-04-29 [1] CRAN (R 4.3.1)
 evaluate        0.21    2023-05-05 [1] CRAN (R 4.3.1)
 fansi           1.0.4   2023-01-22 [1] CRAN (R 4.3.1)
 farver          2.1.1   2022-07-06 [1] CRAN (R 4.3.1)
 fastmap         1.1.1   2023-02-24 [1] CRAN (R 4.3.1)
 forcats       * 1.0.0   2023-01-29 [1] CRAN (R 4.3.1)
 foreign         0.8-84  2022-12-06 [2] CRAN (R 4.3.1)
 Formula         1.2-5   2023-02-24 [1] CRAN (R 4.3.0)
 fs              1.6.3   2023-07-20 [1] CRAN (R 4.3.1)
 gargle          1.5.2   2023-07-20 [1] CRAN (R 4.3.1)
 generics        0.1.3   2022-07-05 [1] CRAN (R 4.3.1)
 ggforce         0.4.1   2022-10-04 [1] CRAN (R 4.3.1)
 ggformula     * 0.10.4  2023-04-11 [1] CRAN (R 4.3.1)
 ggplot2       * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)
 ggrepel       * 0.9.3   2023-02-03 [1] CRAN (R 4.3.1)
 ggridges      * 0.5.4   2022-09-26 [1] CRAN (R 4.3.1)
 ggstance        0.3.6   2022-11-16 [1] CRAN (R 4.3.1)
 glue            1.6.2   2022-02-24 [1] CRAN (R 4.3.1)
 googledrive     2.1.1   2023-06-11 [1] CRAN (R 4.3.1)
 googlesheets4 * 1.1.1   2023-06-11 [1] CRAN (R 4.3.1)
 gridExtra       2.3     2017-09-09 [1] CRAN (R 4.3.1)
 gt            * 0.9.0   2023-03-31 [1] CRAN (R 4.3.1)
 gtable          0.3.4   2023-08-21 [1] CRAN (R 4.3.1)
 haven           2.5.3   2023-06-30 [1] CRAN (R 4.3.1)
 Hmisc           5.1-0   2023-05-08 [1] CRAN (R 4.3.1)
 hms             1.1.3   2023-03-21 [1] CRAN (R 4.3.1)
 htmlTable       2.4.1   2022-07-07 [1] CRAN (R 4.3.1)
 htmltools       0.5.6   2023-08-10 [1] CRAN (R 4.3.1)
 htmlwidgets     1.6.2   2023-03-17 [1] CRAN (R 4.3.1)
 httr            1.4.7   2023-08-15 [1] CRAN (R 4.3.1)
 janitor       * 2.2.0   2023-02-02 [1] CRAN (R 4.3.1)
 jsonlite        1.8.7   2023-06-29 [1] CRAN (R 4.3.1)
 knitr           1.44    2023-09-11 [1] CRAN (R 4.3.1)
 labeling        0.4.3   2023-08-29 [1] CRAN (R 4.3.1)
 labelled        2.12.0  2023-06-21 [1] CRAN (R 4.3.1)
 lattice       * 0.21-8  2023-04-05 [2] CRAN (R 4.3.1)
 lifecycle       1.0.3   2022-10-07 [1] CRAN (R 4.3.1)
 lubridate     * 1.9.2   2023-02-10 [1] CRAN (R 4.3.1)
 magrittr        2.0.3   2022-03-30 [1] CRAN (R 4.3.1)
 MASS            7.3-60  2023-05-04 [2] CRAN (R 4.3.1)
 Matrix        * 1.6-1   2023-08-14 [1] CRAN (R 4.3.1)
 mgcv            1.8-42  2023-03-02 [2] CRAN (R 4.3.1)
 mosaic        * 1.8.4.2 2022-09-20 [1] CRAN (R 4.3.1)
 mosaicCore      0.9.2.1 2022-09-22 [1] CRAN (R 4.3.1)
 mosaicData    * 0.20.3  2022-09-01 [1] CRAN (R 4.3.1)
 munsell         0.5.0   2018-06-12 [1] CRAN (R 4.3.1)
 naniar        * 1.0.0   2023-02-02 [1] CRAN (R 4.3.1)
 nlme            3.1-162 2023-01-31 [2] CRAN (R 4.3.1)
 nnet            7.3-19  2023-05-03 [2] CRAN (R 4.3.1)
 patchwork     * 1.1.3   2023-08-14 [1] CRAN (R 4.3.1)
 pillar          1.9.0   2023-03-22 [1] CRAN (R 4.3.1)
 pkgconfig       2.0.3   2019-09-22 [1] CRAN (R 4.3.1)
 polyclip        1.10-6  2023-09-27 [1] CRAN (R 4.3.1)
 purrr         * 1.0.2   2023-08-10 [1] CRAN (R 4.3.1)
 R6              2.5.1   2021-08-19 [1] CRAN (R 4.3.1)
 RColorBrewer    1.1-3   2022-04-03 [1] CRAN (R 4.3.0)
 Rcpp            1.0.11  2023-07-06 [1] CRAN (R 4.3.1)
 readr         * 2.1.4   2023-02-10 [1] CRAN (R 4.3.1)
 rlang           1.1.1   2023-04-28 [1] CRAN (R 4.3.1)
 rmarkdown       2.25    2023-09-18 [1] CRAN (R 4.3.1)
 rpart           4.1.19  2022-10-21 [2] CRAN (R 4.3.1)
 rstudioapi      0.15.0  2023-07-07 [1] CRAN (R 4.3.1)
 sass            0.4.7   2023-07-15 [1] CRAN (R 4.3.1)
 scales          1.2.1   2022-08-20 [1] CRAN (R 4.3.1)
 sessioninfo     1.2.2   2021-12-06 [1] CRAN (R 4.3.1)
 snakecase       0.11.1  2023-08-27 [1] CRAN (R 4.3.1)
 stringi         1.7.12  2023-01-11 [1] CRAN (R 4.3.0)
 stringr       * 1.5.0   2022-12-02 [1] CRAN (R 4.3.1)
 tibble        * 3.2.1   2023-03-20 [1] CRAN (R 4.3.1)
 tidyr         * 1.3.0   2023-01-24 [1] CRAN (R 4.3.1)
 tidyselect      1.2.0   2022-10-10 [1] CRAN (R 4.3.1)
 tidyverse     * 2.0.0   2023-02-22 [1] CRAN (R 4.3.1)
 timechange      0.2.0   2023-01-11 [1] CRAN (R 4.3.1)
 tweenr          2.0.2   2022-09-06 [1] CRAN (R 4.3.1)
 tzdb            0.4.0   2023-05-12 [1] CRAN (R 4.3.1)
 utf8            1.2.3   2023-01-31 [1] CRAN (R 4.3.1)
 vctrs           0.6.3   2023-06-14 [1] CRAN (R 4.3.1)
 viridisLite     0.4.2   2023-05-02 [1] CRAN (R 4.3.1)
 visdat          0.6.0   2023-02-02 [1] CRAN (R 4.3.1)
 withr           2.5.1   2023-09-26 [1] CRAN (R 4.3.1)
 xfun            0.40    2023-08-09 [1] CRAN (R 4.3.1)
 xml2            1.3.5   2023-07-06 [1] CRAN (R 4.3.1)
 yaml            2.3.7   2023-01-23 [1] CRAN (R 4.3.0)

 [1] C:/Users/thoma/AppData/Local/R/win-library/4.3
 [2] C:/Program Files/R/R-4.3.1/library

────────────────────────────────────────────────────────────────────